25 research outputs found
Exploiting Data Skew for Improved Query Performance
Analytic queries enable sophisticated large-scale data analysis within many
commercial, scientific and medical domains today. Data skew is a ubiquitous
feature of these real-world domains. In a retail database, some products are
typically much more popular than others. In a text database, word frequencies
follow a Zipf distribution with a small number of very common words, and a long
tail of infrequent words. In a geographic database, some regions have much
higher populations (and data measurements) than others. Current systems do not
make the most of caches for exploiting skew. In particular, a whole cache line
may remain cache resident even though only a small part of the cache line
corresponds to a popular data item. In this paper, we propose a novel index
structure for repositioning data items to concentrate popular items into the
same cache lines. The net result is better spatial locality, and better
utilization of limited cache resources. We develop a theoretical model for
analyzing the cache behavior, and implement database operators that are
efficient in the presence of skew. Our experiments on real and synthetic data
show that exploiting skew can significantly improve in-memory query
performance. In some cases, our techniques can speed up queries by over an
order of magnitude
Conditionally Risk-Averse Contextual Bandits
Contextual bandits with average-case statistical guarantees are inadequate in
risk-averse situations because they might trade off degraded worst-case
behaviour for better average performance. Designing a risk-averse contextual
bandit is challenging because exploration is necessary but risk-aversion is
sensitive to the entire distribution of rewards; nonetheless we exhibit the
first risk-averse contextual bandit algorithm with an online regret guarantee.
We conduct experiments from diverse scenarios where worst-case outcomes should
be avoided, from dynamic pricing, inventory management, and self-tuning
software; including a production exascale data processing system
Recommended from our members
Optimizing Query Processing Under Skew
Big data systems such as relational databases, data science platforms, and scientific workflows all process queries over large and complex datasets. Skew is common in these real-world datasets and workloads. Different types of skew can have different impacts on the performance of query processing. Although skew sometimes causes load imbalance in a parallel execution environment, negatively impacting query performance, we demonstrate in this thesis that, in many cases we can actually improve the query performance in the presence of skew. To optimize query processing under skew, we develop a set of techniques to exploit the positive effects of skew and to avoid the negative effects. In order to exploit skew, we propose techniques including: (a) intentionally creating skew and clustering data in a distributed database system; (b) optimizing data layout for better caching in main-memory databases; and (c) adaptive execution techniques that are responsive to the underlying data in the context of compilers. In order to ameliorate skew, we study optimized hash-based partitioning that alleviate outliers in a genomic data context, as well as parallel prefix sum algorithms that used to develop skew-insensitive algorithms. We evaluate the effectiveness of our techniques over synthetic data, standard benchmarks, as well as empirical datasets, and show that the performance of query processing under skew can be greatly improved. Overall this thesis has made a concrete contribution to skew-related query processing
Evaluating multi-way joins over discounted hitting time
The prevalence of graphs in emerging applications has recently raised a lot of research interests. To acquire interesting information hidden in large graphs, tasks including link prediction, collaborative recommendation, and reputation ranking, all make use of proximities between graph nodes. The discounted hitting time (DHT), which is a random-walk similarity measure for graph node pairs, has shown to be useful in various applications. In this thesis, we examine a novel query, called the multi-way join (or n-way join), over DHT scores. Given a graph and n sets of nodes, the n-way join retrieves a ranked list of n-tuples with the k highest scores, according to some aggregation function of DHT values. By extracting such top-k results, this query enables the analysis and prediction of various complex relationships among n sets of nodes on a large graph.
Since an n-way join is expensive to evaluate, we develop the Partial Join algorithm (or PJ). This solution decomposes an n-way join into a number of top-m 2-way joins, and combines their results to construct the answer of the n-way join. Since the process of PJ may necessitate the computation of top-(m + 1) 2-way joins, we study an incremental solution, which saves the trouble of recomputation and allows the results of top-(m+1) 2-way join to be derived quickly from the top-m 2-way join results earlier computed. For better performance, we further examine efficient processing algorithms and pruning techniques for 2-way joins. Through extensive experiments on three real graph datasets, we show that the proposed PJ algorithm accurately evaluates n-way joins, and is four orders of magnitude faster than basic solutions.published_or_final_versionComputer ScienceMasterMaster of Philosoph
Evaluating Multi-Way Joins over Discounted Hitting Time
Abstract—The discounted hitting time (DHT), which is a random-walk similarity measure for graph node pairs, is useful in various applications, including link prediction, collaborative recommendation, and reputation ranking. We examine a novel query, called the multi-way join (or n-way join), on DHT scores. Given a graph and n sets of nodes, the n-way join retrieves a set of n-tuples with the k highest scores, according to some aggregation function of DHT values. This query enables analysis and prediction of complex relationship among n sets of nodes. Since an n-way join is expensive to compute, we develop the Partial Join algorithm (or PJ). This solution decomposes an n-way join into a number of top-m 2-way joins, and combines their results to construct the answer of the n-way join. Since PJ may necessitate the computation of top-(m + 1) 2-way joins, we study an incremental solution, which allows the top-(m + 1) 2-way join to be derived quickly from the top-m 2-way join results earlier computed. We further examine fast processing and pruning algorithms for 2-way joins. An extensive evaluation on three real datasets shows that PJ accurately evaluates n-way joins, and is four orders of magnitude faster than basic solutions. I
Supplementary Online Material
NiO thickness dependence of the exchange bias field,The derivation the NiO thickness dependence of SMR, MR measurement for control sample, and harmonic Hall measurement
Realization of Multi‐Level State and Artificial Synapses Function in Stacked (Ta/CoFeB/MgO)N Structures
Abstract Spintronic devices can realize multi‐state storage and be used to simulate artificial synapses or artificial neurons, which makes them have promising application prospect in the field of artificial neural networks (ANN). This work investigates the current‐induced magnetization reversal in stacked (Ta/CoFeB/MgO)N structures and their application in ANN. It is demonstrated that the complete current‐induced magnetization reversal with large intermediate transition region can be achieved in the sample with N = 2. The magneto‐optical Kerr microscope imaging shows that the large transition region for the sample is ascribed to the “layer‐by‐layer” reversal, owing to the difference of the coercivity of two CoFeB layers. In addition, the simulation of artificial synapses and artificial neurons function based on current‐induced magnetization reversal in the sample is also demonstrated. These results substantiate the stacked (Ta/CoFeB/MgO)N structures as a promising platform for realizing the multi‐level state and artificial synapses function, and its potential application in the field of ANN
Predictive deployment of UAV base stations in wireless networks:machine learning meets contract theory
Abstract
In this paper, a novel framework is proposed to enable a predictive deployment of unmanned aerial vehicles (UAVs) as temporary base stations (BSs) to complement ground cellular systems in face of downlink traffic overload. First, a novel learning approach, based on the weighted expectation maximization (WEM) algorithm, is proposed to estimate the user distribution and the downlink traffic demand. Next, to guarantee a truthful information exchange between the BS and UAVs, using the framework of contract theory, an offload contract is developed, and the sufficient and necessary conditions for having a feasible contract are analytically derived. Subsequently, an optimization problem is formulated to deploy an optimal UAV onto the hotspot area in a way that the utility of the overloaded BS is maximized. Simulation results show that the proposed WEM approach yields a prediction error of around 10%. Compared with the expectation maximization and k-mean approaches, the WEM method shows a significant advantage on the prediction accuracy, as the traffic load in the cellular system becomes spatially uneven. Furthermore, compared with two event-driven deployment schemes based on the closest-distance and maximal-energy metrics, the proposed predictive approach enables UAV operators to provide efficient communication service for hotspot users in terms of the downlink capacity, energy consumption and service delay. Simulation results also show that the proposed method significantly improves the revenues of both the BS and UAV networks, compared with two baseline schemes